Raw Data

Questions¶

Which factors determine the success of an NBA team?¶

Initially, we were interested in how payroll caps and luxury taxes had an impact on baseball teams. The data for the MLB, however, was difficult to obtain, so we explored other professional leagues. Eventually, we settled on the NBA and asked a similar question: what factors are critical to the success of an NBA team?

Factors that we investigated include:

  • finances
  • team marketability
  • fan engagement
  • geographic location
  • age of players

We sought to determine which, if any, of these variables are best at predicting the success of a team regarding wins.

Data Sources and Clean-up Process¶

  • basketball-reference

  • RunRepeat

  • HoopsHype

  • Hoop Social

Pertinant data was pulled from these sources and added to excel and exported as a csv file. We chose five years as this time span is long enough to capture broader trends in the NBA.

Analysis¶

In [1]:
# import libraries

import csv
import pandas as pd
import string
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import requests
import json
import hvplot.pandas
import seaborn as sns
import numpy as np

from scipy.stats import linregress
import holoviews as hv
hv.extension('bokeh')
from bokeh.models import ColumnDataSource, Legend
from bokeh.plotting import figure, show
from bokeh.transform import dodge
from bokeh.io import export_png
from bokeh.io import export_svgs
from math import pi

from config import geoapify_key
Bad key "text.kerning_factor" on line 4 in
/opt/anaconda3/envs/PythonData/lib/python3.7/site-packages/matplotlib/mpl-data/stylelib/_classic_test_patch.mplstyle.
You probably need to get an updated matplotlibrc file from
https://github.com/matplotlib/matplotlib/blob/v3.1.3/matplotlibrc.template
or from the matplotlib source distribution

Import Main Dataframe¶

In [2]:
# Import NBA franchise data from 2017-2021 and replace NaN with zeros
file_path = 'Resources/NBA-Complete.csv'
NBA_df = pd.read_csv(file_path)
NBA_df = NBA_df.fillna(0)
In [3]:
# Display df
NBA_df.head()
Out[3]:
Team Age (2017) W (2017) L (2017) Pct (2017) Attend. (2017) Income (2017) All Stars (2017) Age (2018) W (2018) ... Attend. (2020) Income (2020) All Stars (2020) Age (2021) W (2021) L (2021) Pct (2021) Attend. (2021) Income (2021) All Stars (2021)
0 Atlanta Hawks 27.9 43 39 0.524 654306 22 1.0 25.4 24 ... 545453 36 1.0 25.4 41 31 0.569 59288.0 37 0.0
1 Boston Celtics 25.9 53 29 0.646 760690 85 1.0 24.7 55 ... 610864 86 2.0 25.1 36 36 0.500 30067.0 46 2.0
2 Brooklyn Nets 26.0 20 62 0.244 632608 52 0.0 25.1 28 ... 524907 44 0.0 28.2 48 24 0.667 30491.0 -80 2.0
3 Charlotte Hornets 26.5 36 46 0.439 710643 21 1.0 26.6 36 ... 478591 36 0.0 24.6 33 39 0.458 68255.0 34 0.0
4 Chicago Bulls 26.9 41 41 0.500 888882 95 1.0 24.4 27 ... 639352 115 0.0 25.6 31 41 0.431 13655.0 39 1.0

5 rows × 36 columns

Attendance versus Wins¶

In [4]:
x = NBA_df['Team']
y1 = NBA_df['Pct (2017)']
y2 = NBA_df['Attend. (2017)']

#figure(figsize=(300, 100), dpi=20)
plt.rc('font', size=10)

# Create the bar graph
fig, ax1 = plt.subplots()
ax1.bar(x, y1, color='y')
ax1.set_xlabel('Teams')
ax1.set_ylabel('Win %', color='k')

# Create the second set of y-axis labels
ax2 = ax1.twinx()
ax2.stem(x, y2,use_line_collection=True)
ax2.set_ylabel('Attendance', color='k')

ax1.set_xticklabels(x, rotation=90)

# Save plot in outputs folder
plt.savefig("Output/Attd0.png", dpi=300, bbox_inches='tight')

# Show the plot
plt.show()
In [5]:
x = NBA_df['Team']
y1 = NBA_df['Pct (2019)']
y2 = NBA_df['Attend. (2019)']

#figure(figsize=(300, 100), dpi=20)
plt.rc('font', size=10)

# Create the bar graph
fig, ax1 = plt.subplots()
ax1.bar(x, y1, color='y')
ax1.set_xlabel('Teams')
ax1.set_ylabel('Win %', color='k')

# Create the second set of y-axis labels
ax2 = ax1.twinx()
ax2.stem(x, y2,use_line_collection=True)
ax2.set_ylabel('Attendance', color='k')

ax1.set_xticklabels(x, rotation=90)

# Save plot in outputs folder
plt.savefig("Output/Attd1.png", dpi=300, bbox_inches='tight')

# Show the plot
plt.show()

Age versus Wins¶

In [6]:
sns.set_style("whitegrid")

sns.lmplot(x="Age (2021)", y="Pct (2021)", data=NBA_df, line_kws={"color":"red"})
plt.xlabel("Average Team Age (2021)", fontweight='bold')
plt.ylabel("Team Win Percentage (2021)", fontweight='bold')

plt.savefig("Output/Age0.png", dpi=300, bbox_inches='tight')

plt.show()
In [7]:
sns.set_style("whitegrid")

sns.lmplot(x="Age (2020)", y="Pct (2020)", data=NBA_df, line_kws={"color":"orange"})
plt.xlabel("Average Team Age (2020)", fontweight='bold')
plt.ylabel("Team Win Percentage (2020)", fontweight='bold')

plt.savefig("Output/Age1.png", dpi=300, bbox_inches='tight')


plt.show()
In [8]:
sns.set_style("whitegrid")

sns.lmplot(x="Age(2019)", y="Pct (2019)", data=NBA_df, line_kws={"color":"blue"})
plt.xlabel("Average Team Age (2019)", fontweight='bold')
plt.ylabel("Team Win Percentage (2019)", fontweight='bold')

plt.savefig("Output/Age2.png", dpi=300, bbox_inches='tight')

plt.show()
In [9]:
sns.set_style("whitegrid")

sns.lmplot(x="Age (2018)", y="Pct (2018)", data=NBA_df, line_kws={"color":"purple"})
plt.xlabel("Average Team Age (2018)", fontweight='bold')
plt.ylabel("Team Win Percentage (2018)", fontweight='bold')

plt.savefig("Output/Age3.png", dpi=300, bbox_inches='tight')

plt.show()
In [10]:
sns.set_style("whitegrid")

sns.lmplot(x="Age (2017)", y="Pct (2017)", data=NBA_df, line_kws={"color":"green"})
plt.xlabel("Average Team Age (2017)", fontweight='bold')
plt.ylabel("Team Win Percentage (2017)", fontweight='bold')

plt.savefig("Output/Age4.png", dpi=300, bbox_inches='tight')

plt.show()

Income and Payroll versus Wins¶

In [15]:
# Show scatter chart
show(nba_scatter)
In [16]:
# Print the r-value for each year
print(f'The r-value is: {rvalue}.')
print(f'The 2017 r-value is: {rvalue1}.')
print(f'The 2018 r-value is: {rvalue2}.')
print(f'The 2019 r-value is: {rvalue3}.')
print(f'The 2020 r-value is: {rvalue4}.')
print(f'The 2021 r-value is: {rvalue5}.')
The r-value is: 0.25013462921759505.
The 2017 r-value is: 0.34129414919782575.
The 2018 r-value is: 0.6360615671322686.
The 2019 r-value is: 0.3987123814842942.
The 2020 r-value is: 0.4003994838684177.
The 2021 r-value is: 0.4203519954833007.
In [19]:
# Show chart
show(nba)

Media Market Versus Wins¶

  • We seek to determine whether the size of a franchise's media market has any impact on its success. We define success as either a winning record or profitibality.

  • Maps are created to visualize the data, and statistical tests are done to make conclusions.

  • Finally, a correlation heatmap is created with all variables of interest in the study. A summary and final thoughts are presented.

Data¶

  • The data is comprised of the main data compiled by our team as well as media market data obtained from hoop-social.com. An API search is also done to pull geocoordinates of each team's arena.

  • Additional variables are calculated for use in summarizing out findings.

Import Additional Files and Merge¶

In [21]:
# Create a function to remove punctuation from columns in a dataframe
def remove_punctuation(input_string):
    # Make a translation table to remove all punctuation characters
    translator = str.maketrans('', '', string.punctuation)
    
    # Use translate method to remove all punctuation characters
    no_punct = input_string.translate(translator)
    
    return no_punct

Use API Search to Find Arena Coordinates¶

In [24]:
# Pull arena latitude and longitude from geoapify app
for index, row in full_NBA_df.iterrows():
    arena = row["Arena"]
    target_url = f"https://api.geoapify.com/v1/geocode/search?text={arena}&format=json&apiKey={geoapify_key}"
    geo_data = requests.get(target_url).json()
    try:
        full_NBA_df.loc[index, "Arena Lat"] = geo_data["results"][0]['lat']
        full_NBA_df.loc[index, "Arena Lon"] = geo_data["results"][0]['lon']
        print(f"Coordinates found for {arena}")
    except: 
        print('Could not find coordinates')
Coordinates found for State Farm Arena
Coordinates found for TD Garden
Coordinates found for Barclays Center
Coordinates found for Spectrum Center
Coordinates found for United Center
Coordinates found for Rocket Mortgage Fieldhouse
Coordinates found for American Airlines Center
Coordinates found for Ball Arena
Coordinates found for Little Caesars Arena
Coordinates found for Chase Center
Coordinates found for Toyota Center
Coordinates found for Gainbridge Fieldhouse
Coordinates found for Cryptocom Arena
Coordinates found for Cryptocom Arena
Coordinates found for FedEx Forum
Coordinates found for FTX Arena
Coordinates found for Fiserv Forum
Coordinates found for Target Center
Coordinates found for Smoothie King Center
Coordinates found for Madison Square Garden IV
Coordinates found for Paycom Center
Coordinates found for Amway Center
Coordinates found for Wells Fargo Center
Coordinates found for Phoenix Suns Arena
Coordinates found for Moda Center
Coordinates found for Golden 1 Center
Coordinates found for ATT Center
Coordinates found for Scotiabank Arena
Coordinates found for Vivint Smart Home Arena
Coordinates found for Capital One Arena

Calculate 5-year Statistics and Create New Columns¶

In [26]:
# Add column for average win

Pct_df = NBA_df[["Team", "Pct (2017)", "Pct (2018)", "Pct (2019)", "Pct (2020)", "Pct (2021)"]]
full_NBA_df['mean Pct'] = Pct_df.iloc[:, 1:].mean(axis=1)

# Add column for average age

Age_df = NBA_df[["Team", "Age (2017)", "Age (2018)", "Age(2019)", "Age (2020)", "Age (2021)"]]
full_NBA_df['mean Age'] = Age_df.iloc[:, 1:].mean(axis=1)

# Add column for average income

inc_df = NBA_df[["Team", "Income (2017)", "Income (2018)", "Income (2019)", "Income (2020)", "Income (2021)"]]
full_NBA_df['mean Income'] = inc_df.iloc[:, 1:].mean(axis=1)

# Add column for average attendance

Att_df = NBA_df[["Team", "Attend. (2017)", "Attend. (2018)", "Attend. (2019)", "Attend. (2020)"]]
full_NBA_df['mean Attendance'] = Att_df.iloc[:, 1:].mean(axis=1)

# Add column for average payroll

pay_df = full_NBA_df[["Team", "Payroll (2017)", "Payroll (2018)", "Payroll (2019)", "Payroll (2020)","Payroll (2021)"]]
full_NBA_df['mean Payroll'] = pay_df.iloc[:, 1:].mean(axis=1)


# Add column for average rank

rk_df = full_NBA_df[["Team", "Rk (2017)", "Rk (2018)", "Rk (2019)", "Rk (2020)", "Rk (2021)"]]
full_NBA_df['mean Rank'] = rk_df.iloc[:, 1:].mean(axis=1)

# Calculate total wins over the five year sample period

wins_df = full_NBA_df[["Team", "W (2017)", "W (2018)", "W (2019)", "W (2020)", "W (2021)"]]
full_NBA_df['Total Wins'] = wins_df.iloc[:, 1:].sum(axis=1)

Create Categorical Variables¶

In [27]:
# Create Categories for metro size

bins = [0, 2500000, 5000000, 20000000]
labels = ['Small(<2.5M)', 'Medium(2.5M-5M)', 'Large(>5M)']

full_NBA_df['Metro Categories'] = pd.cut(full_NBA_df['Metro Population'], bins=bins, labels=labels, right=False)


# Create Categories for win count

bins = [130, 190, 250]
labels = ['Below Average', 'Above Average']

full_NBA_df['Win Categories'] = pd.cut(full_NBA_df['Total Wins'], bins=bins, labels=labels, right=False)

# Create Categories for income

bins = [10, 43, 150]
labels = ['Below Median', 'Above Median']

full_NBA_df['Income Categories'] = pd.cut(full_NBA_df['mean Income'], bins=bins, labels=labels, right=False)

Mapping Values¶

In [28]:
# Configure the map plot showing metro pop by size and wins by color
wins_map = full_NBA_df.hvplot.points(
    "Arena Lon",
    "Arena Lat",
    geo = True,
    tiles = "CartoLight",
    frame_width = 700,
    frame_height = 500,
    size = "Metro Population",
    scale = 0.01,
    color = "Total Wins",
    hover_cols = ["Team"],
    clabel = 'Total Win Count',
    title = 'NBA Teams by Win Count and Market Size'
)

# Save to output folder
hvplot.save(wins_map, 'Output/NBAmapWins.html')
In [29]:
# Show map
wins_map
Out[29]:
In [30]:
# Configure the map plot showing metro pop by size and mean income by color
income_map = full_NBA_df.hvplot.points(
    "Arena Lon",
    "Arena Lat",
    geo = True,
    tiles = "CartoLight",
    frame_width = 700,
    frame_height = 500,
    size = "Metro Population",
    scale = 0.008,
    color = "mean Income",
    hover_cols = ["Team"],
    clabel = 'Mean Income (Millions)',
    title = 'NBA Teams by Mean Income and Market Size'
)


# Save to output folder
hvplot.save(income_map, 'Output/NBAmapIncome.html')
In [31]:
# Show map
income_map
Out[31]:

Map Interpretation¶

  • From an initial glance at the first map, the number of wins does not seem to be correlated with the market size of a franchise.

  • In the second map, it does appear that the mean income of a franchise does seem to be correlated with the market size.

Chi-Square Analysis¶

In [32]:
from scipy.stats import chi2_contingency

# Chi Square for Metro Size versus Income
# Create a contingency table
contingency_table = pd.crosstab(full_NBA_df['Metro Categories'], full_NBA_df['Income Categories'])
print(contingency_table)
# Perform the chi-square test
chi2, p, dof, expected = chi2_contingency(contingency_table)
print("----------------------------------------------------")
# Check the p-value to determine the significance of the test
if p < 0.05:
    print("Reject the null hypothesis - The variables are dependent")
else:
    print("Fail to reject the null hypothesis - The variables are independent")
print(f"The p-value is {p}.")
Income Categories  Below Median  Above Median
Metro Categories                             
Small(<2.5M)                  7             2
Medium(2.5M-5M)               5             4
Large(>5M)                    2            10
----------------------------------------------------
Reject the null hypothesis - The variables are dependent
The p-value is 0.017205950425851393.
In [33]:
# Chi Square for Metro Size versus Wins
# Create a contingency table
contingency_table = pd.crosstab(full_NBA_df['Metro Categories'], full_NBA_df['Win Categories'])
print(contingency_table)
# Perform the chi-square test
chi2, p, dof, expected = chi2_contingency(contingency_table)
print("----------------------------------------------------")
# Check the p-value to determine the significance of the test
if p < 0.05:
    print("Reject the null hypothesis - The variables are dependent")
else:
    print("Fail to reject the null hypothesis - The variables are independent")
print(f"The p-value is {p}.")
Win Categories    Below Average  Above Average
Metro Categories                              
Small(<2.5M)                  4              5
Medium(2.5M-5M)               5              4
Large(>5M)                    6              6
----------------------------------------------------
Fail to reject the null hypothesis - The variables are independent
The p-value is 0.8948393168143698.

Chi-Square Results¶

  • The variables "Metro Population" "mean Income" and "Total Wins" were discretized into categories using the pd.cut method, and chi-square analysis was run to test our initial observations.

  • Looking at the results of chi square analysis, market size and total wins are independent. This means that the size of a market does not have an impact on a team's ability to win.

  • However, the variables of market size and mean income are dependent. This means that cities with larger media markets tend to be more profitable.

Conclusion¶

Create Correlation Heatmap with Seaborn¶

In [34]:
# Create new df with all variables of interest
avgs_df = full_NBA_df[["Total Wins", 'mean Attendance','mean Payroll','mean Income','Metro Population','mean Age']]

# Create a mask for the upper triangle so that values are only shown once
mask = np.zeros_like(avgs_df.corr(), dtype=np.bool)
mask[np.triu_indices_from(mask,k=1)] = True

# Create heatmap
sns.heatmap(avgs_df.corr(), annot=True, cmap='coolwarm',mask=mask)

# Save to output folder
plt.savefig("Output/NBAheatmap.png", dpi=300, bbox_inches='tight')

Summary of Findings¶

Our correlation matrix displays how all variables of interest are related to each other.

  • Attendance at games does not seem to impact a team's ability to win.

    • r = 0.25
  • Player age actually has a positive, moderate relationship with a team's ability to win. More aggregate experience among players seems to lead to success.

    • r = 0.56
  • A positive, moderate relationship exists for payroll and team wins. Better players will cost a premium.

    • r = 0.52
  • A team's income, however, has little to do with a winning record.

    • r = 0.017
  • A team's market size also has little to do with a team's ability to win. In fact, many teams in smaller metro areas have top-tier records. The relationship is negative but weak.

    • r = -0.2
  • The market size of a team does impact a franchise's ability to generate income. Teams in larger metropolital areas are more successful from a financial standpoint.

    • r = 0.52

Final Thoughts¶

Experience seems to be an important factor in an NBA team's ability to succeed as the relationship between wins and mean age was the strongest in our study. This experience, though, will come with a cost since high player payrolls are also correlated with wins.

From a financial perspective, a winning team does not necessarily translate large profits. The factors of market size and average attendance do more to predict a larger income than a team's win record. For an owner, the ability to generate income depends more on marketability.